Datasets¶

A Dataset is a specialization of a Resource that provides users with operations to handle files, record their provenance and describe them with metadata.

In [ ]:

from kgforge.core import KnowledgeGraphForge

A configuration file is needed in order to create a KnowledgeGraphForge session. A configuration can be generated using the notebook 00-Initialization.ipynb.

Note: DemoStore doesn't implement file operations yet. Use the BluBrainNexus store instead when creating a config file.

In [ ]:

forge = KnowledgeGraphForge("../../configurations/forge.yml")

Imports¶

In [ ]:

from kgforge.core import Resource

In [ ]:

from kgforge.specializations.resources import Dataset

In [ ]:

import pandas as pd

Creation with files added as parts¶

In [ ]:

! ls -p ../../data | egrep -v /$

In [ ]:

persons = Dataset(forge, name="Interesting Persons")

In [ ]:

persons.add_files("../../data/persons.csv")

In [ ]:

forge.register(persons)

In [ ]:

forge.as_json(persons)

In [ ]:

associations = Dataset(forge, name="Associations data")

In [ ]:

associations.add_files("../../data/associations.tsv")

In [ ]:

associations.add_derivation(persons)

In [ ]:

forge.register(associations)

In [ ]:

forge.as_json(associations)

In [ ]:

# By default the files are downloaded in the current path (path="."). The urls or the files to download can be collected from a different json path (by setting a value for "follow") and 
# the files downloaded to a different path (by setting a value for "path")
# The argument overwrite: bool can be provided to decide whether to overwrite (True) existing files with the same name or
# to create new ones (False) with their names suffixed with a timestamp.
# A cross_bucket argument can be provided to download data from the configured bucket (cross_bucket=False - the default value) 
# or from a bucket different than the configured one (cross_bucket=True). The configured store should support crossing buckets for this to work.
associations.download(source="parts")

In [ ]:

# A specific path can be provided.
associations.download(path="./downloaded/", source="parts")

In [ ]:

# A specific content type can be downloded.
associations.download(path="./downloaded/", source="parts", content_type="text/tab-separated-values")

In [ ]:

! ls -l ./downloaded

In [ ]:

# ! rm -R ./downloaded/

Creation with files added as distribution¶

In [ ]:

persons = Dataset(forge, name="Interesting Persons")

In [ ]:

persons.add_distribution("../../data/associations.tsv")

In [ ]:

persons.add_image(path='../../data/non_existing_person.jpg', content_type='application/jpeg', about='Person')

In [ ]:

forge.register(persons)

In [ ]:

forge.as_json(persons)

In [ ]:

# When files are added as distributions, they can be directly downloaded without specifying which json path to use to collect the downlodable urls. In addition, content type and path arguments
# can still be provided
persons.download()

Creation with resources added as parts¶

In [ ]:

distribution_1 = forge.attach("../../data/associations.tsv")

In [ ]:

distribution_2 = forge.attach("../../data/persons.csv")

In [ ]:

jane = Resource(type="Person", name="Jane Doe", distribution=distribution_1)

In [ ]:

john = Resource(type="Person", name="John Smith", distribution=distribution_2)

In [ ]:

persons = [jane, john]

In [ ]:

forge.register(persons)

In [ ]:

dataset = Dataset(forge, name="Interesting people")

In [ ]:

dataset.add_parts(persons)

In [ ]:

forge.register(dataset)

In [ ]:

forge.as_json(dataset)

In [ ]:

dataset.download(path="./downloaded/", source="parts")

In [ ]:

! ls -l ./downloaded

In [ ]:

# ! rm -R ./downloaded/

Creation from resources converted as Dataset objects¶

In [ ]:

dataset = Dataset.from_resource(forge, [jane, john], store_metadata=True)
print(*dataset, sep="\n")

Creation from a dataframe¶

See notebook 07 DataFrame IO.ipynb for details on conversions of instances of Resource from a Pandas DataFrame.

basics¶

In [ ]:

dataframe = pd.read_csv("../../data/persons.csv")

In [ ]:

dataframe

In [ ]:

persons = forge.from_dataframe(dataframe)

In [ ]:

forge.register(persons)

In [ ]:

dataset = Dataset(forge, name="Interesting people")

In [ ]:

dataset.add_parts(persons)

In [ ]:

forge.register(dataset)

In [ ]:

forge.as_json(dataset)

advanced¶

In [ ]:

dataframe = pd.read_csv("../../data/associations.tsv", sep="\t")

In [ ]:

dataframe

In [ ]:

dataframe["distribution"] = dataframe["distribution"].map(lambda x: forge.attach(x))

In [ ]:

associations = forge.from_dataframe(dataframe, na="(missing)", nesting="__")

In [ ]:

print(*associations, sep="\n")

In [ ]:

forge.register(associations)

In [ ]:

dataset = Dataset(forge, name="Interesting associations")

In [ ]:

dataset.add_parts(associations)

In [ ]:

forge.register(dataset)

In [ ]:

forge.as_json(dataset)